Overview

The goal of this analysis is to take publicly available retinal single bipolar cell sequencing data and correctly identify the 11 known bipolar cell types using R, the R package Seurat, and known applicable cell markers. The data to be analyzed will be the 25,000 retinal bipolar cells from the 2017 August paper, “COMPREHENSIVE CLASSIFICATION OF RETINAL BIPOLAR NEURONS BY SINGLE-CELL TRANSCRIPTOMICS” by Shekhar et al. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5003425/) data publicly available at https://portals.broadinstitute.org/single_cell/study/retinal-bipolar-neuron-drop-seq. The proper identification of these cells will lead to a greater understanding of the roles each cell type preforms in the eye and could also help to identify novel gene markers for the cell types analyzed. These potential novel markers may help future studies successfully categorize retinal cells.

To properly understand and manipulate the data three UPenn faculty members were consulted. Dr. Dwight Stambolian, from the department of Ophthalmology and Human Genetics, was consulted to provided ideas and information on how the result of the analysis should appear and which graphics would be most useful to show any results. Dr. Mingyao Li, from the department of Biostatistics and Epidemiology was consulted to gain a better understanding of single cell sequencing data and what programs would be best suited to analyze such data. Dr. Randy J. Zauhar, Professor of Chemistry and Biochemistry, was able to give instruction on how to check the quality of a sequencing run. With the guidance provided by the above listed persons, the direction of the project was made clear.

Final GitHub Repository: https://github.com/NicholasDana/BMIN503_Final_Project

Introduction

The ability to correctly identify cell types plays a critical role when interpreting single cell sequencing data. Single cell sequencing grants insight into the role each cell may play in a cell population based on the genes expressed within each individual cell. As opposed to tissue sequencing which is limited to providing information between two distinct tissue types, such as a control tissue and a diseased tissue, single cell sequencing can provide answers to which genes in which cell are responsible for a manifested behavior. Due to this increased resolution of sequencing data, it has become critical to identify which cells are which and establish control expression profiles for cell types of interest.

The issue presented by this project requires knowledge on; how to interact with and manipulate single cell sequencing data, known biological markers of human retinal bipolar cells, how to preform and interpret statistical queries, and the role observed expressed genes preform. All these required disciplines make solving this issue of cell type identification a complicated process which requires continued re-analysis and interpretations with multiple cell populations. The guidance from the previously mentioned UPenn faculty members will hopefully allow this project to successfully identify cell types and possibly novel cell markers.

Methods

The data to be analyzed is the sequencing data from the 25,000 retinal bipolar cells. The data comes in the form of an expression matrix, listing the expression data from each of the 25,000 cells for each gene observed. The data is stated to already be medium normalized and log transformed. Originally the data was analyzed by Shekhar et al. in R with thier own disclosed code and function, however this analysis will utilize the R package Seurat which was created specifically to interpret single cell data.

library(Seurat)
library(dplyr)

Quality Control

The analysis starts with the creation of a Seurat object, which needs to recieve the expression data matrix. The Seurat object has numerous meta data and functions which are generated from the data upon creation. By plotting the number of genes and number of unique molecular identifiers (UMI) we can better understand the distribution of cells and use this information to apply potential filters. We can also look at the ident of the Seurat object to list all the cells and see any cell groupings that may be present in the data.

setwd("~/BMIN503_Final_Project/")
shekhar.data=read.table(paste0("Retinal Bipolar Neurons/exp_matrix.txt"),header = TRUE,row.names = 1)
shekhar=CreateSeuratObject(raw.data = shekhar.data)
head(shekhar@ident)
## Bipolar1_CCCACAAGACTA Bipolar1_TCGCCTCGTAAG Bipolar1_CAAAGCATTTGC 
##              Bipolar1              Bipolar1              Bipolar1 
## Bipolar1_CTTTTGATTGAC Bipolar1_GCTCCAATGACA Bipolar1_AAATACCCTCAT 
##              Bipolar1              Bipolar1              Bipolar1 
## Levels: Bipolar1 Bipolar2 Bipolar3 Bipolar4 Bipolar5 Bipolar6
table(shekhar@ident)
## 
## Bipolar1 Bipolar2 Bipolar3 Bipolar4 Bipolar5 Bipolar6 
##     3055     3419     3662     3851     7129     6383
VlnPlot(object = shekhar,features.plot = c("nGene"))

VlnPlot(object = shekhar,features.plot = c("nUMI"))

We can see that there are 6 groupings of cells unevenly distributed, possibly indicating the cells were sequenced in 6 batches, Bipolar1 through Bipolar6. Looking at the number of genes and unique molecular identifiers we can see some outlyer cells which may be cell doubets, two different cells which were sequenced together as a single cell, that we may decide to filter out. Each point on the graph represents an individual cell.

The mitochondial genes are also commonly used to filter out unwanted cells. By identifying all expressed mitochondrial genes contained in the expression matrix, and then comparing the percentage of mitochondrial expression to all other genes, we can determine how much of a cell is composed of mitochondrial genes.

shekhar.mito.genes=grep(pattern = "mt-",x = rownames(x = shekhar@raw.data),value = TRUE)
shekhar.mito.genes
##  [1] "mt-Co1"  "mt-Cytb" "mt-Nd1"  "mt-Nd2"  "mt-Nd4"  "mt-Nd5"  "mt-Nd6" 
##  [8] "mt-Rnr1" "mt-Rnr2" "mt-Ta"   "mt-Tc"   "mt-Ti"   "mt-Tm"   "mt-Tp"  
## [15] "mt-Tq"   "mt-Ts2"  "mt-Tt"   "mt-Tv"   "mt-Tw"   "mt-Ty"
length(shekhar.mito.genes)
## [1] 20
shekhar.mito=colSums(shekhar@raw.data[shekhar.mito.genes,])/colSums(shekhar@raw.data)
shekhar=AddMetaData(object = shekhar,metadata = shekhar.mito,col.name = "percent.mito")
VlnPlot(object = shekhar,features.plot = c("percent.mito"))

There are a total of 20 mitochondrial genes listed above. Based on the distribution of cells by mitochondrial percentage, it is observed that the bulk of cells tend to have around 12-15% mitochondrial content. It may be a good idea to filter out cells with more than 30% mitochondrial content as a means to reduce the number of cell doublets in the data, unless it is anticipated that a rare cell type may have high mitochondrial content. To better understand the data quality we can plot cells by multiple variables, and observe any obvious outlyers to help set potential filter levels.

GenePlot(object = shekhar,gene1 = "nUMI",gene2 = "percent.mito")

GenePlot(object = shekhar,gene1="nUMI",gene2 = "nGene")

GenePlot(object = shekhar,gene1 = "percent.mito",gene2 = "nGene")

Using the above information as a qualitative measure of the data, cell filters were set for the number of genes, the number of unique molecular identifiers, and the percentage of mitochondrial content. Any cell with more than 3000 expressed genes, 30% mitochondrial gene content, or 950 unique molecular identifiers were filtered out of the data. To see the effect of these filters, the previous plots will be reproduced on the filtered data.

shekhar=FilterCells(object = shekhar,subset.names = c("nGene","percent.mito","nUMI"),low.thresholds = c(0,-Inf,0),high.thresholds = c(3000,0.3,950))
VlnPlot(object = shekhar,features.plot = c("nGene","nUMI","percent.mito"),nCol = 3,ident.include = shekhar@ident)

par(mfrow=c(1,3))
GenePlot(object = shekhar,gene1 = "nUMI",gene2 = "percent.mito")
GenePlot(object = shekhar,gene1="nUMI",gene2 = "nGene")
GenePlot(object = shekhar,gene1 = "percent.mito",gene2 = "nGene")

Genes that show high variablity between the cells are now identified by calculating the average expression and dispersion for each gene, a list of these genes is stored within the Seurat object @var.genes. These genes will be important as a good biological marker should show relatively high expression in only the cell type of interest. We would expect a good marker to have a low average expression and high variability across all cells, while having a high average expression and low variability in the cells it is expected to mark. After looking the mean variability plot, the cutoff parameters were set to mark visual outlyers on the plot. With the current parameters there are 347 variable genes identified, which could be potential cell markers.

shekhar=FindVariableGenes(object = shekhar,mean.function = ExpMean,dispersion.function = LogVMR,x.low.cutoff = 0.0125,x.high.cutoff = 1.75,y.cutoff = 1.5)

str(shekhar@var.genes)
##  chr [1:347] "2700049A03Rik" "2810474O19Rik" "2900052N01Rik" ...

The dataset may also contain variation from unintresting sources, such as batch effects and biological information such as mitochondrial expression. Removing these sources of variation can allow for greater accuracy in determining intresting cell type gene expression variation. This is accomplished by constructing linear models to predict and scale the gene expression based only on the variables we find interesting. The effect of the number of molecular identifiers and mitochondrial content will be mitigated from the dataset. Normally the data would be normalized before scaling the data, however the data was stated to be medium normalized and log transformed, so this step was omitted.

shekhar=ScaleData(object = shekhar,vars.to.regress = c("nUMI","percent.mito"))
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## [1] "Regressing out nUMI"         "Regressing out percent.mito"
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |==                                                               |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |=================                                                |  27%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |===================                                              |  30%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |========================                                         |  36%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |==========================                                       |  39%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |============================                                     |  42%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |==============================                                   |  45%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |================================                                 |  48%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |=================================                                |  52%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |===================================                              |  55%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |=====================================                            |  58%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |=======================================                          |  61%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |=============================================                    |  70%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |===============================================                  |  73%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=========================================================        |  87%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |===========================================================      |  90%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |=============================================================    |  93%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |===============================================================  |  96%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================|  99%
  |                                                                       
  |=================================================================| 100%
## [1] "Scaling data matrix"
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |=================================================================| 100%

Principle Component Analysis

Principle component analysis will be preformed to better understand the variance in this multivariate dataset. A total of 30 principle components will be calculated for the scaled dataset using the previous 347 highly variable genes. Each PC calculated will effectively represent an extra layer of data for a set of correlated genes from the variable genes list. The strength of the correlated genes included in the analysis will ultimately decide how the cells are eventually clustered together, so determining how many PCs to include will have a large effect on the data.

shekhar=RunPCA(object = shekhar,pc.genes = shekhar@var.genes,do.print = FALSE, pcs.print = 1:5,genes.print = 5,pcs.compute = 30)

The principle components of the analysis will be visualized and reviewed using 4 methods. First by Viewing the cells and genes that define each principle component.

PrintPCA(object = shekhar,pcs.print = 1:15, genes.print=10, use.full = FALSE)
## [1] "PC1"
##  [1] "Scgn"    "Gria2"   "Gngt2"   "Lhx4"    "Gngt1"   "Gsg1"    "Cabp2"  
##  [8] "Pcp4l1"  "Vsx1"    "Fam19a3"
## [1] ""
##  [1] "Apoe"   "Dkk3"   "Rlbp1"  "Glul"   "Slc1a3" "Sparc"  "Mfge8" 
##  [8] "Clu"    "Car14"  "Cd9"   
## [1] ""
## [1] ""
## [1] "PC2"
##  [1] "Mt1"     "Hopx"    "Cep112"  "Igfn1"   "Wbscr17" "Zfml"    "Col1a2" 
##  [8] "Chodl"   "Fn1"     "Vax2os" 
## [1] ""
##  [1] "Sag"   "Rho"   "Pdc"   "Pde6g" "Gnat1" "Rcvrn" "Rp1"   "Gnb1" 
##  [9] "Rs1"   "Pde6b"
## [1] ""
## [1] ""
## [1] "PC3"
##  [1] "Fam19a3"       "Scgn"          "Grik1"         "Slitrk6"      
##  [5] "A730046J19Rik" "Fezf2"         "Gsg1"          "Otor"         
##  [9] "Slit2"         "Esam"         
## [1] ""
##  [1] "Mt1"    "Cep112" "Prph2"  "Igfn1"  "Cplx4"  "Rho"    "Vax2os"
##  [8] "Hopx"   "Gnat1"  "Pdc"   
## [1] ""
## [1] ""
## [1] "PC4"
##  [1] "Zfhx4"  "Cdh8"   "Pcp4l1" "Tnnt1"  "Tacr3"  "Nxph1"  "Neto1" 
##  [8] "Vsx1"   "Slit2"  "Lamp5" 
## [1] ""
##  [1] "Gsg1"    "Lhx4"    "Nnat"    "Meis2"   "Gngt1"   "Nfia"    "Otor"   
##  [8] "Prkar2b" "Neurod2" "Grik1"  
## [1] ""
## [1] ""
## [1] "PC5"
##  [1] "Fezf2"         "Cep112"        "A930003A15Rik" "Fam19a3"      
##  [5] "Mt1"           "Otor"          "Rpgrip1"       "Wfdc2"        
##  [9] "Hopx"          "Grik1"        
## [1] ""
##  [1] "Gjd2"    "Vsx1"    "Gria2"   "Cplx4"   "Cdh9"    "Cdh8"    "Cck"    
##  [8] "Nfia"    "Neurod2" "Pcp4l1" 
## [1] ""
## [1] ""
## [1] "PC6"
##  [1] "Cck"    "Gngt2"  "Lect1"  "Gnat2"  "Unc13c" "Scgn"   "Tnnt1" 
##  [8] "Pcp4l1" "Spock3" "Cabp2" 
## [1] ""
##  [1] "Tacr3"  "Nxph1"  "Ncam2"  "Wls"    "Nfia"   "Pcdh10" "Slit2" 
##  [8] "Pde1a"  "Meis2"  "Cdh9"  
## [1] ""
## [1] ""
## [1] "PC7"
##  [1] "Igfn1"   "Vsx1"    "Pcdh17"  "Cabp2"   "Fn1"     "Snhg11"  "Gnat3"  
##  [8] "Sox5"    "Rpgrip1" "Wfdc2"  
## [1] ""
##  [1] "Bhlhe22" "Lhx4"    "Lamp5"   "Neto1"   "Gsg1"    "Chrnb3"  "Ebf1"   
##  [8] "Chrna6"  "Cdh8"    "Gngt1"  
## [1] ""
## [1] ""
## [1] "PC8"
##  [1] "Cd9"      "Abca8a"   "Vim"      "Slitrk2"  "Enpp2"    "Kdr"     
##  [7] "Pon2"     "Kcnj10"   "Aqp4"     "Cacna2d2"
## [1] ""
##  [1] "Fosb"  "Junb"  "Ier2"  "Btg2"  "Fos"   "Egr1"  "Zfp36" "Dusp1"
##  [9] "Cyr61" "Id1"  
## [1] ""
## [1] ""
## [1] "PC9"
##  [1] "Grik1"    "Nnat"     "Cacna2d2" "Vsx1"     "Pcp4l1"   "Igfn1"   
##  [7] "Slitrk6"  "Fosb"     "Dner"     "Gng2"    
## [1] ""
##  [1] "Tmem200a" "Areg"     "Gngt1"    "Igfbp4"   "Adarb2"   "Cngb3"   
##  [7] "Cabp2"    "Cxcl12"   "Ryr3"     "Pcdh10"  
## [1] ""
## [1] ""
## [1] "PC10"
##  [1] "Igfn1"  "Gnat3"  "Gsg1"   "Neto1"  "Fn1"    "Sox5"   "Ebf1"  
##  [8] "Chrnb3" "Lamp5"  "Fmo2"  
## [1] ""
##  [1] "Cck"           "Lect1"         "Spock3"        "Epha7"        
##  [5] "A730046J19Rik" "Gjd2"          "Meis2"         "Nfia"         
##  [9] "BC046251"      "Camk4"        
## [1] ""
## [1] ""
## [1] "PC11"
##  [1] "Atp1b1"  "Sparcl1" "Snhg11"  "Tfap2b"  "Spock3"  "Pax6"    "Lamp5"  
##  [8] "Igfbp7"  "Uaca"    "Cplx2"  
## [1] ""
##  [1] "BC046251"     "Sox6"         "Cdh9"         "Epha7"       
##  [5] "Medag"        "Cplx4"        "Irx5"         "Nfia"        
##  [9] "Fezf2"        "RP23-124H2.4"
## [1] ""
## [1] ""
## [1] "PC12"
##  [1] "RP23-124H2.4" "Nxph1"        "Wls"          "Pcdh10"      
##  [5] "Fn1"          "Ncam2"        "Magi3"        "Camk4"       
##  [9] "Nnat"         "Cacna2d1"    
## [1] ""
##  [1] "Wfdc2"         "Pcdh17"        "Mylk"          "A930003A15Rik"
##  [5] "Slit2"         "Chrna6"        "Fezf2"         "Rpgrip1"      
##  [9] "Postn"         "Bhlhe22"      
## [1] ""
## [1] ""
## [1] "PC13"
##  [1] "Wfdc1"   "Itm2a"   "Aldh1a1" "Igfbp7"  "Mdk"     "Gpc3"    "Dapl1"  
##  [8] "Dclk1"   "Col9a1"  "Hopx"   
## [1] ""
##  [1] "Xist"   "Cplx2"  "Atp1b1" "Lrrtm1" "Spock3" "Cxcl14" "Cacng4"
##  [8] "Fosb"   "Tfap2b" "Slc6a1"
## [1] ""
## [1] ""
## [1] "PC14"
##  [1] "Neto1"   "Ebf1"    "Lamp5"   "Slitrk6" "Tacr3"   "Rcvrn"   "Sox6"   
##  [8] "Pcp4l1"  "Nfib"    "Bhlhe22"
## [1] ""
##  [1] "Erbb4"   "Rnd3"    "Irx5"    "Gsg1"    "Fam161a" "Gpr22"   "Pcdh7"  
##  [8] "Slit2"   "Wls"     "Aipl1"  
## [1] ""
## [1] ""
## [1] "PC15"
##  [1] "BC046251" "Sox6"     "Atp1b1"   "Filip1l"  "Snhg11"   "Ptprt"   
##  [7] "Zfml"     "Tfap2b"   "Arhgap20" "Kcnb2"   
## [1] ""
##  [1] "Lrrtm1" "Pcdh7"  "Xist"   "Cxcl14" "Meis2"  "Ptgds"  "Cplx2" 
##  [8] "Vsx1"   "Itm2a"  "Rbp1"  
## [1] ""
## [1] ""

By plotting the top genes assiociated with principle components.

VizPCA(object = shekhar,pcs.use = 1:9)

VizPCA(object = shekhar,pcs.use = 10:18)

VizPCA(object = shekhar,pcs.use = 19:27)

By plotting the output of principle components against each other.

PCAPlot(object = shekhar, dim.1=1,dim.2=2)

PCAPlot(object = shekhar, dim.1=1,dim.2=15)

PCAPlot(object = shekhar, dim.1=1,dim.2=30)

Finally the principle components will be visualized by use of heatmaps.

shekhar=ProjectPCA(object = shekhar,do.print = FALSE)
#PCHeatmap(object = shekhar,pc.use = 1,cells.use = 500,do.balanced = TRUE,label.columns = FALSE)
PCHeatmap(object = shekhar,pc.use = 1:9,cells.use = 400,do.balanced = TRUE,label.columns = FALSE,use.full = FALSE)

PCHeatmap(object = shekhar,pc.use = 10:18,cells.use = 400,do.balanced = TRUE,label.columns = FALSE,use.full = FALSE)

PCHeatmap(object = shekhar,pc.use = 19:27,cells.use = 400,do.balanced = TRUE,label.columns = FALSE,use.full = FALSE)

As determining the true dimensionality of the dataset and how many PCs to include has a great influence on the clustering and identification of cell types, many approaches to analyze each PC must be considered. Jack Straw plots allow for the visualization of gene p-values within the plotted PC. By comparing the distribution of p-values between the 30 PCs computed, a more informed decision on which PCs to include downstream can be made. Significant PCs should have distribution curves above the dashed line, or null distribution. A plot of the standard deviation for each PC will also be reviewed, to determine if there is any prominent feature that can be used to determine a cut off PC.

shekhar=JackStraw(object = shekhar,num.replicate = 100,do.print = FALSE,num.pc = 30)
JackStrawPlot(object = shekhar,PCs = 1:30,nCol = 5)

PCElbowPlot(object = shekhar,num.pc = 30)

Comparing the distribution of p-values for each PC, the last significant PC appears to be PC 21, with PC 22 and onward beginning to touch and fall under the null distribution. Also comparing the standard deviations seems to show that beyond PC 19-23 the change in standard deviation becomes miniscule. As the other methods previously implemented did not present an imediately apparent significant PC cut-off point, the analysis will continue using 21 PCs.

Cell Clustering and Identification

Cell clustering is implemented in Seurat by a shared nearest neighbor (SNN) modularity optimization based algorithm. This algorithm is capable of finding clusters of varying shapes, sizes, and densities in high dimensional data. In short it would compute a similarity matrix based on the imputed PCs, determine core points, and form clusters based on the core points. It is expected that cells of the same cell type would cluster together. To visualize the clusters t-distributed Stochastic neighbor Embedding (tSNE) dimensionality reduction will be implemented, this is a machine learning algorithm, non-linear dimensionality redution technique that is capable of embedding high dimensional data into two or three dimensions which can then be plotted.

shekhar=FindClusters(object = shekhar,reduction.type = "pca",dims.use = 1:21,resolution = 0.6,print.output = 0,save.SNN = TRUE)
shekhar=RunTSNE(object = shekhar,dims.use = 1:21,do.fast = TRUE)
PrintFindClustersParams(object = shekhar)
## Parameters used in latest FindClusters calculation run on: 2017-12-08 11:36:30
## =============================================================================
## Resolution: 0.6
## -----------------------------------------------------------------------------
## Modularity Function    Algorithm         n.start         n.iter
##      1                   1                 100             10
## -----------------------------------------------------------------------------
## Reduction used          k.param          k.scale          prune.SNN
##      pca                 30                25              0.0667
## -----------------------------------------------------------------------------
## Dims used in calculation
## =============================================================================
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
TSNEPlot(object = shekhar,do.label = T)

For each of the clusters generated, differentially expressed genes are identified to determine potential markers.

library(dplyr)
shekhar.markers=FindAllMarkers(object = shekhar,only.pos = TRUE,min.pct = 0.25,thresh.use = 0.25)
shekhar.markers %>% group_by(cluster) %>% top_n(3,avg_logFC)
## # A tibble: 54 x 7
## # Groups:   cluster [18]
##    p_val avg_logFC pct.1 pct.2 p_val_adj cluster  gene
##    <dbl>     <dbl> <dbl> <dbl>     <dbl>  <fctr> <chr>
##  1     0 0.9364673 0.997 0.926         0       0 Calm1
##  2     0 0.8575562 0.992 0.655         0       0  Pcp2
##  3     0 0.7930499 0.952 0.539         0       0  Chgb
##  4     0 0.7367398 0.986 0.662         0       1  Pcp2
##  5     0 0.7327898 0.953 0.546         0       1  Chgb
##  6     0 0.7200614 0.994 0.928         0       1 Calm1
##  7     0 3.0025385 1.000 0.139         0       2  Apoe
##  8     0 2.7202980 0.997 0.212         0       2  Glul
##  9     0 2.1316671 0.996 0.325         0       2 Acsl3
## 10     0 0.8965868 0.872 0.424         0       3   Hlf
## # ... with 44 more rows

These markers will be compared to the markers identified by the Shekhar et. al. paper to identify clusters that exhibit the same gene expressions.

Results

Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.

The goal of this project is to correctly identify the cell types of the 25,000 bipolar cells presented by Shekhar et al. The identification of the cells will be done by comparing the top differentially expressed genes, or markers, for each cluster to known bipolar cell markers. The accuracy of this cell identification process will be checked by comparison to the clusters and differentially expressed genes of the same clusters found by Shekhar et al. As they originally analyzed this data, this analysis would be expected to at least partially match cell clusters, even though the methods used in each analysis are different than one another.

The shekhar paper utilized three different principle component analysis and clustering tenchinques/analysis to arrive at three different clustering possibility for the data. These analyses lead to the formation of 26, 31, and 29 clusters, of which the Louvain-Jaccard generated clustering of 26 was selected as the best representation. The analysis presented in this project yeilded 18 clusters. Shekhar et al. was able to identify 14 bipolar cell cluster amoung their 26 by checking for the expression of Prkca, which is indicitive of bipolar rod cell types, and Scgn, which is indicitive of bipolar cone cell types.

VlnPlot(object = shekhar,features.plot = c("Prkca")) #Bipolar Rod Cells

FeaturePlot(object = shekhar, features.plot = c("Prkca"), cols.use = c("grey", "blue"), reduction.use = "tsne")

VlnPlot(object = shekhar,features.plot = c("Scgn"))  #Bipolar Cone Cells

FeaturePlot(object = shekhar, features.plot = c("Scgn"), cols.use = c("grey", "blue"), reduction.use = "tsne")

VlnPlot(object = shekhar,features.plot = c("Isl1"))  #Bipolar On Cone Cells

FeaturePlot(object = shekhar, features.plot = c("Isl1"), cols.use = c("grey", "blue"), reduction.use = "tsne")

VlnPlot(object = shekhar,features.plot = c("Grm6"))  #Bipolar Off Cone Cells

FeaturePlot(object = shekhar, features.plot = c("Grm6"), cols.use = c("grey", "blue"), reduction.use = "tsne")

The above plots visualize the expression value of each of the 4 mentioned genes with the cells categorized into one of the 18 clusters, or by indicating the location of the cell on the tSNE plot. From these specific gene expressions we can conclude that: clusters 1, 2, and 4 correspond to rod bipolar cells; clusters 3, 5, 6, 7, and 14 correspond to ON-cone bipolar cells; and clusters 8, 9, 10, 11, and 12 correspond to OFF-cone bipolar cells. This identifies 13 total bipolar cell clusters to known markers. Cluster 16 may also be identifiable as ON-cone bipolar cells, as the known ON-cone bipolar markers show decent expression in those cells. Including cluster 16 that brings the total identified bipolar clusters from this analysis to 14, matching those completed by the literature analysis.

shekhar2=shekhar
current.cluster.ids <-c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)
new.cluster.ids <-c("Rod BCs", "Rod BCs",2,"ON Cone BC1","Rod BCs","ON Cone BC2","ON Cone BC3","ON Cone BC4","OFF Cone BC1","OFF Cone BC2","OFF Cone BC3","OFF Cone BC4","OFF Cone BC5",14,"ON Cone BC5",15,16,17)
shekhar2@ident <-plyr::mapvalues(shekhar2@ident, from = current.cluster.ids, to = new.cluster.ids)
TSNEPlot(object = shekhar2,do.label = T)

There are currently 11 known and identified bipolar cell types, 1 rod and 10 cone cell types, which seemingly match with the clusters identified by the 4 cell markers used. This leave 5 clusters unaccounted for which indicate other cell types may be present. These cells could be contamination in the form of other retinal cells, which may have been introduced to the population during cell preperation. Therefore, we check for a few other types of retinal cells closely tied to bipolar cells, namely photorecptors and ganglion cells. Rax is reported to be a general marker for photorecptors, while Rho is reported to mark rod photoreceptors specifically. Calb2, Map1a, and Map2 are reported to mark ganglion cells.

VlnPlot(object = shekhar,features.plot = c("Calb2")) 

FeaturePlot(object = shekhar, features.plot = c("Calb2"), cols.use = c("grey", "blue"), reduction.use = "tsne")

VlnPlot(object = shekhar,features.plot = c("Map1a")) 

FeaturePlot(object = shekhar, features.plot = c("Map1a"), cols.use = c("grey", "blue"), reduction.use = "tsne")

VlnPlot(object = shekhar,features.plot = c("Map2")) 

FeaturePlot(object = shekhar, features.plot = c("Map2"), cols.use = c("grey", "blue"), reduction.use = "tsne")

VlnPlot(object = shekhar,features.plot = c("Rax"))

FeaturePlot(object = shekhar, features.plot = c("Rax"), cols.use = c("grey", "blue"), reduction.use = "tsne")

VlnPlot(object = shekhar,features.plot = c("Rho")) 

FeaturePlot(object = shekhar, features.plot = c("Rho"), cols.use = c("grey", "blue"), reduction.use = "tsne")

The expression data for the markers check do not seem to identify any ganglion cells, however clusters 2 and 14 show high expression for phtoreceptor cell markers. With these 2 additional clusters being identified that leaves 3 clusters with unassigned cell types. These clusters may be contamination of some other sort, or may correlate to an undiscovered type of bipolar cell.

shekhar3=shekhar
current.cluster.ids <-c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)
new.cluster.ids <-c("Rod BCs", "Rod BCs","Photorecptors","ON Cone BC1","Rod BCs","ON Cone BC2","ON Cone BC3","ON Cone BC4","OFF Cone BC1","OFF Cone BC2","OFF Cone BC3","OFF Cone BC4","OFF Cone BC5","Cone Photorecptors","ON Cone BC5",15,16,17)
shekhar3@ident <-plyr::mapvalues(shekhar3@ident, from = current.cluster.ids, to = new.cluster.ids)
TSNEPlot(object = shekhar3,do.label = T)

table(shekhar3@ident)
## 
##            Rod BCs      Photorecptors        ON Cone BC1 
##              11147               3089               2198 
##        ON Cone BC2        ON Cone BC3        ON Cone BC4 
##               1872               1768               1707 
##       OFF Cone BC1       OFF Cone BC2       OFF Cone BC3 
##               1254               1115                842 
##       OFF Cone BC4       OFF Cone BC5 Cone Photorecptors 
##                553                528                525 
##        ON Cone BC5                 15                 16 
##                374                284                184 
##                 17 
##                 47

Looking at the counts for each cell type, we can see that the remaining unidentified cells are also the lowest in cell number, which may also point to unoptimized or incomplete clustering. Keeping in mind the goals of this exercise, I am fairly convinced that all 11 known bipolar cell types have been accurately identified by the methodology implemented here. Each individual cell type cluster was not investigated individually for unique features, which may have been an ambitious goal for the scope of this project. Photorecptor cells were found within the data, which is to be expected due to the functional connection they share with bipolar cells, however not finding any ganglion cells is somewhat unexpected. With more literature review, more cell markers for various cell types could be discovered and tested on the data to gain a better understanding on the unidentified clusters. This in turn would help to confirm or deny the present of yet to be classified bipolar cell types within the 25,000 cells investigated.